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Abstract. This paper presents a general coding method where data in 
a Hilbert space are represented by finite dimensional coding vectors. The 
method is based on empirical risk minimization within a certain class of 
linear operators, which map the set of coding vectors to the Hilbert space. 
Two results bounding the expected reconstruction error of the method 
are derived, which highlight the role played by the codebook and the 
class of linear operators. The results are specialized to some cases of 
practical importance, including A'-means clustering, nonnegative matrix 
factorization and other sparse coding methods. 



Index Terms: Empirical risk minimization, estimation bounds, AT-means clus- 
tering and vector quantization, statistical learning. 

1 Introduction 

We study a general class of A'-dimensional coding methods for data drawn from 
a distribution /i on the unit ball of a Hilbert space H . These methods encode a 
data point x ~ as a vector y e , according to the formula 

y = argmin||a;-Tj/||^, 

where Y C is a prescribed set of codes (called the codebook), which we can 
always assume to span R^, and T : M.^ H is a linear map, which defines a 
particular implementation of the codebook. It embeds the codebook Y in H and 
yields the set T (Y) of exactly codable patterns. If y is the code found for x then 
X = Ty is the reconstructed data point. The quantity 

/t (x) = min - Tyf 

yeY 



is called the reconstruction error. 
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Given a codebook Y and a finite number of independent observations xi, . . . , Xm ~ 
yU, a common sense approach searches for an implementation T which is optimal 
on average over the observed points, that is 



where T denotes some class of linear maps T : R^' — > H. As we shall see, this 
framework is general enough to include principal component analysis, K-means 
clustering, non-negative matrix factorization p.Oj and the sparse coding method 
as proposed in [T^ . 

Whenever the codebook Y is compact and T is bounded in the operator 
norm this approach is justified by the following high-probability, uniform bound 
on the expected reconstruction error. 

Theorem 1. Suppose that Y is a closed subset of the unit ball o/M^, that there 
is c > 1 such that \\T\\^ < c for all T ^ T and that 5 G (0,1). Then with 
probability at least 1 ~ 5 in the observed data xi, . . . , Xm ^ fJ. we have for every 
T eT that 



The bound is two-sided in the sense that also with probability at least 1 — S we 
have for every T G T that 



Any compact subset of can of course be down-scaled to be contained in 
the unit ball, and the scaling factor can be absorbed in c, so that the above 
result is applicable to any compact codebook. 

The theorem implies a bound on the excess risk: let Tq € 7~ be a minimizer of 
the expected reconstruction error within the set T. It follows from the definition 
of T and the above result that the expected reconstruction error of T is with 
high probability not more than O (1/ \/rn) worse than that of Tq. 

This order in m is optimal, as we know from existing lower bounds for K- 
means clustering 3 . The above dependence on K is, however, generally not 
optimal, and can be considerably improved with a more careful analysis, if we 
are prepared to accept the slightly inferior rate of -^/In m/m in the sample size. 
To state this improvement define 




(1) 





lirii 



Y — 



sup ||T|1 



Y — 



sup sup \\Ty\\ . 

TeTyeY 



TeT 



We then have the following result. 
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Theorem 2. Assume that ||T||y > 1 and that the functions fx forT G T, when 
restricted to the unit ball of H , have range contained in [0,6]. Fix 6 > 0. 

Then with probability at least 1 ~ 6 in the observed data xi, . . . , Xm ^ ^ we 
have for every T eT that 

^^^.h h (.0 < ^ (i4 iini, + ^y^^Mm^) +fev^ 

The bound is two sided in the same sense as the previous result. 

Both results immediately imply uniform convergence in probability. We are 
not aware of other results for nonnegative matrix factorization [10 or the sparse 
coding techniques as in |14j . 

Before proving our results, we will illustrate their implications in some cases 
of interest. It turns out that the dependence on K in Theorem [2] adapts to the 
specific situation under consideration. 

A preliminary version of this paper appeared in the proceedings of the 2008 
Algorithmic Learning Theory Conference |12| . The new version contains Theo- 
rem [1] and a simplified proof of Theorem [3] with improved constants. 

2 Examples of coding schemes 

Several coding schemes can be expressed in our framework. We describe some of 
these methods and how our result applies. 

2.1 Principal component analysis 

Principal component analysis (PCA) seeks a ii'-dimensional orthogonal projec- 
tion which maximizes the projected variance and then uses this projection to 
encode future data. A projection P can be expressed as TT* where T is an 
isometry which maps to the range of P. Since 

llPxIl^ = - \\x - Pxf ^ \\x\f - min ||a; - Tyf 

finding P to maximize the true or empirical expectation of ||Px||^ is equivalent to 
finding T to minimize the corresponding expectation of miUj^gRA- ||a; — Ty\\^ . We 
see that PCA is described by our framework upon the identifications Y = 
and T is restricted to the class of isometrics T : — >■ H. Given T G T and 
X G H the reconstruction error is 

/t (x) = min ||a; - Ty\\^ . 

If the data are constrained to be in the unit ball of H, as we generally assume, 
then it is easily seen that we can take Y to be the unit ball of M.^ without 
changing any of the encodings. We can therefore apply Theorem[2]with ||T||y = 1 



4 



and 6=1. This is besides the point however, because in the simple case of PC A 
much better bounds are available (see [15], [19] and Lemma|6]below). In [19 local 
Rademacher averages are used to give faster rates under certain circumstances. 

An objection to PCA is, that generic codes have K nonzero components, 
while for practical and theoretical reasons sparse codes with much less than K 
nonzero components may be preferable [14j . 



2.2 JsT-means clustering or vector quantization 

Here Y — {ei, . . . , ei<-}, where the vectors form an orthonormal basis of 
. An implementation T now defines a set of centers {Tei, . . . , Te^f }, the 
reconstruction error is min^j^ \\x — Te^W and a data point x is coded by the 
such that Tefe is nearest to x. The algorithm ^ becomes 

r = argmin — > min llxi — Tefell . 
TeT m ^ k=i 

1=1 

It is clear that every center Teu has at most unit norm, so that ||T|ly = 1. Since 
all data points are in the unit ball we have — Tek\\^ < 4 so we can set 6 = 4 
and the bound in Theorem [2] becomes 

^ / \/m V m 



The order of this bound matches up to vhim the order given in [?] or [IB] . 
To illustrate our method we will also prove the bound 



TO V TO 

(Theorem [S]), which is essentially the same as those in [3] or [TB]. There is a 
lower bound of order ^jK^m in [3] , and it is unknown which of the two bounds 
(upper or lower) is tight. 

In if-means clustering every code has only one nonzero component, so that 
sparsity is enforced in a maximal way. On the other hand this results in a weaker 
approximation capability of the coding scheme. 

2.3 Nonnegative matrix factorization 

Here Y is the positive orthant in M^, that is the cone 

Y = {v:y= (yi, . . . , y^), 2/fe > 0, 1 < fc < i^} . 

A chosen map T generates a cone T (y) C H onto which incoming data is 
projected. In the original formulation by Lee and Seung [10] it is postulated 
that both the data and the vectors Te/c be contained in the positive orthant 
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of some finite dimensional space, but we can drop most of these restrictions, 
keeping only the requirement that {Tek,Tei) > for i < /c, / < K. 

No coding will change if we require that ||Tefc|| = 1 for all 1 < fc < i^T by a 
suitable normalization. The set T is then given by 

r={r:Te/:(R^,iJ), ||refc|l-l, {Tek,Tei)>0, l<k,l<K}. 

We can restrict Y to its intersection with the unit ball in (see Lemma [2] 
below). We obtain that ||T||y = Vk. Hence, Theorem [5] yields the bound 

on the estimation error. We do not know of any other generalization bounds for 
this coding scheme. 

Nonnegative matrix factorization appears to encourage sparsity, but cases 
have been reported where sparsity was not observed [11]. In fact this undesir- 
able behavior should be generic for exactly codable data. Various authors have 
therefore proposed additional constraints ([H], [7]). It is clear that additional 
constraints on T can only improve estimation and that the passage from K to a 
subset can only improve our bounds, because the quantity ||T|| y would decrease. 



2.4 Sparse coding 

Another method arises by choosing the £p-unit ball as a codebook. Let Y = {y : 
y e M.'^, \\y\\p < 1} and T ^ {T : M.''^ ^ H : \\Tek\\ <l,l<k<K}.We have 

K K / K 

\\Ty\\ = WY^yuTeuW < ^ |y.|||re,|| < ^ llTe,!!" < K'/'^ ^ K^-^l^ 

k=\ k=\ \fc=l / 

implying that UTIIy < K^-^I'p. 

By the same argument as above all the ]t have range contained in [0, 1], so 
Theorem [5] can be applied with 6 = 1 to yield the bound 

on the estimation error. The best bound is obtained when p = 1, and the order 
in K matches that of the bound for if-means clustering described earlier. 

The method for p = 1 is similar to the sparse-coding method proposed by 
Olshausen and Field [T^, with the difference that the term is used as a 
penalty term instead of the hard constraint ||y||i < 1. The method of Olshausen 
and Field |14j approximates with a compromise of geometric proximity and spar- 
sity and our result asserts that the observed value of this compromise generalizes 
to unseen data if enough data have been observed. 



3 Proofs 



We first introduce some notation, conventions and auxiliary results. Then we set 
about to prove Theorems [T] and [2j 

3.1 Notation, definitions and auxiliary results 

Throughout H denotes a Hilbert space. The term norm and the notation ||-|| 
and (•, •) always refer to the Euclidean norm and inner product on or on H . 
Other norms are characterized by subscripts. If Hi and H2 are any Hilbert spaces 
C {Hi,H2) denotes the vector space of bounded linear transformations from Hi 
to H2. liHi = H2 we just write C (Hi) ^ C {Hi, Hi). With W -^2) we denote 
the set of isometrics in C {Hi, H2), that is maps U satisfying = 
for all X G Hi. 

We use £2 (H) for the set of Hilbert-Schmidt operators on H, which be- 
comes itself a Hilbert space with the inner product (T, 8)2 =tr(T*5') and the 
corresponding (Frobenius) norm ||-||2. 

For X G H the rank-one operator is defined by Q^z = {z,x)x. For any 
T e £2 {H) the identity 

{T*T,Q,)2^\\Tx\f 

is easily verified. 

Suppose that Y C spans M^. It is easily verified that the quantity 

llT||y-sup||rj;|| 

defines a norm on C (R^ , H) . 

We use the following well known result on covering numbers (see, for example. 
Proposition 5 in [5]). 

Proposition 1. Let B be a ball of radius r in an N -dimensional Banach space 
and e > 0. There exists a subset B^ C B such that \Bf \ < (4r/e)^ and Vz £ 
B, 3z' G with d{z, z') < e, where d is the metric of the Banach space. 

The following concentration inequality, known as the bounded difference in- 
equality [13^ , goes back to the work of HoefFding [5] . 

Theorem 3. Let fi^ be a probability measure on a space Xi, for i = 1, . . . , m. 
Let X = '^i '^"'^ M ~ ^iLil^i ^6 t^T'^ product space and product measure 

respectively. Suppose the function W : X satisfies 

|if (x)-if (x')| <c, 

whenever x and x' G A" differ only in the i-th coordinate, where ci, . . . ,c,„ are 
some positive parameters. Then 
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Throughout ai wiU denote a sequence of mutuahy independent random vari- 
ables, uniformly distributed on {—1, 1} and 7j, will be (multiple indexed) 
sequences of mutually independent Gaussian random variables, with zero mean 
and unit standard deviation. 

If is a class of real-valued functions on a space X and fi a probability 
measure on X then for m € N the Rademacher and Gaussian complexities of 
w.r.t. /X are defined (^,[2]) as 



TZ„i A*) = — Ex~M-E'^ sup (7if {xi) 



r„i (J", /i) = — E ^^fj,m.E^ sup J J (a 



respectively. 

Appropriately scaled Gaussian complexities can be substituted for Rademacher 
complexities, by virtue of the next Lemma. For a proof see, for example, [SI p. 
97]. 

Lemma 1. ForY CR'' we have TZ {Y) < F (F). 

The next result is known as Slepian's lemma ([17], [9]). 

Theorem 4. Let fi and E he mean zero, separable Gaussian processes indexed 
by a common set S, such that 



Ther 



: (J7si - f2s2f < E {^si - "sa)^ for all si, S2 e S. 



; sup i7s < Esupn^. 

ses ses 



The following result, which generalizes Theorem 8 in plays a central role 
in our proof. 

Theorem 5. Let {J-n : 1 < ?^ < N} be a finite collection of [0,b]-valued function 
classes on a space X , and fi a probability measure on X . Then € (0, 1) we 
have with probability at least 1 — S that 



max sup 



E 



1 ™ 



1=1 



< max 72.™ (J"„,Ai) + b 

n<N 



In TV + In (1/5) 



2m 



Proof. Denote with tf'n the function on A"™ defined by 



<Fn (x) = sup 



E, 



m ' ^ 



X e A-" 



By standard symmetrization (see, for example, [18] ) we have Kx^^mWn (x) < 
TZm (-^ri) A*) ^ niax„<jv T^m {J^n, m)- Modifying one of the Xi can change the value 
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of any (x) by at most 6/m, so that by a union bound and the bounded 
difference inequahty (Theorem [3]) 



Pr i max^n > max7^„ (J^„, /i) + < i < V Pr {tf'„ > Etf'n + t} <Ne 

n<N n<N ^ 



Solving 5 = iVe 2m(t/b)^ ^ gives the result. 



□ 



Notice that replacing the functions / € J-n by 6 — / does not affect the 
Rademacher complexities, so the above result can be used in a two-sided way. 
The following lemma was used in Section [ 



Lemma 2. Suppose \\x\\ < 1, \\ck\\ = 1, (cfc,Q) > 0, y e , > 0. If y 

minimizes 

K 2 



k=l 



h{y)^ 

then \\y\\ < 1. 

Proof. Assume that y is a minimizer of h and \\y\\ > l.Then 

K 



^ykCk 

k=l 



\\yf + ^ykyi {ck,ci) > i. 

k^l 



Let the real- valued function / be defined by f [t) — h (ty). Then 





K 

^ykCk 

k=l 


z 


\ fe=i 




K 

^VkCk 

fe=l 


2 


K 

^ykCk 
fc=i 


) 




K 

^VkCk 

fe=l 


^ 


) 


K 

y^.ykCk 

fe=l 



> 0. 

So / cannot have a minimum at 1, whence y cannot be a minimizer of h. □ 
3.2 Proof of the main results 

We now fix a spanning codebook y C and recall that, for T e £ (K^^, H), 
we had introduced the notation 



fT{x)= ml\\x -Ty\\\xeH. 
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Our principal object of study is the function class 

where T C £(M^,7J) is some fixed set of candidate implementations of our 
coding scheme. We first address the rather general Theorem [T] which can be 
treated in parallel to the case of iiT-means clustering. We begin with a technical 
lemma. 

Lemma 3. Suppose that 

1. (cfc : 1 < fc < K) is an orthonormal basis ofM.^; 

2. T is the class of linear operators T : M.^ — > H with |lT'efe|| < c; 

3. {xi : \ < i < m) is a sequence Xi G H, \\xi\\ < 1; 

4- ilik ■ i < i < m, 1 < k < K) and (jiki : 1 < i < m, 1 < k,l < K) are or- 
thogaussian sequences. 

Then the following three inequalities hold 

m K 



E7 sup V V 7,j. (a;,, Tck) < cKy/m 

T^^^=lk=l 
rn K 

m K 

sup^ ^ 7,fe, {Tek.Tei)<^K 



Ter 



i=l k,l = l 



Proof. Using Cauchy-Schwarz' and Jensen's inequalities and the orthogaussian 
properties of the j^f., we get 



K 



K 



E7 sup X! X! "^^fc ^^'^^ - ^^-^ X! 



Ter 



fc=i j=i 



fe=i 



< cK^/rn 



which is the first inequality. Similarly we obtain 

K m K 

IE7 sup X \\Tek\?< c^E, X 

Tf^T — — — 
Km K 

E^, sup V y^liki {Tek,Tei) < c^E^ V 



1=1 



feJ=l i=l 



k,l=l 



^likl 
i=l 



< C^K^yf^. 



□ 



Proposition 2. Suppose that the probability measure fi is supported on the unit 
ball of H, that {e/c : 1 < fc < K} is an orthonormal basis of M.^ and that T is 
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a class of linear operators T : M.^ — > H with ||Tefc|| < c for 1 < fc < K , with 
c > 1. Let Y he a nonempty closed subset of the unit hall in and 

Fy = \ x e H ^ Tmn\\x - Tyf : T e T 

Then 

V m 

and ifY = {e^ : 1 < fc < K} then the hound improves to 



n{TY,f^)<c^KJ—. 

V m 

Proof. By Lemma [T] it suffices to bound tlie corresponding Gaussian averages, 
which we shah do using Slepian's Lemma (Theorem |4|). First fix a sample x and 
define Gaussian processes Q and S indexed by T 

Ht — 7i min 1 1 a;^ — Ty \ \ ^ and 

y 

i 

Et = V^^l.k , Tefc) + V2 ^ 7,;fc {Tei , Tcfe) . 

ik ilk 

Suppose Ti,T2 e T- For any x e H we have, using (a + bf < 2a^ + 26^ and 
Cauchy-Schwarz 

2 2V 
min lla; — Tiyll — min lla; — T2y|| ) 
yeY y J 

< max \\x - Tiyll^ - \\x - T2y\\'^ ] 

< 8 max yk {x, (Ti - T2) e^) +2 max ykyi (efc, {T^Ti - T2T2) e/) 

\k J '^^^ Vlf J 

< 8^((x,Tiefe) - (a;,r2efc))' + 2^((Tiefe,Tie/)- (T2efc,T2ej))'. 

fe kl 

We therefore have 

E{{2ti - ^T-if = ^ f min ||a;i - Tiy||^ - minjlx^ - T^jyH^ j 

< 8^ {{xi.Tiek) - {xi,T2ek) f + 2 ^ ((Tiet, Tie;) - {T2ek,T2ei)f 



ikl 
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So, by Slepian's Lemma and the first and last inequalities in Lemma [3] 

E sup f^T < IE sup St 
Ter TeT 

< VSE sup V 7,fc {x^,Tek) + V2E sup V 7,;^ {Tei, Tck 

ik iLk 

Multiply by \/27r/m to get a bound on the Rademacher complexity of 



To obtain the second conclusion we improve the bound on the Gaussian average. 
With Ht as above we set 



K 



2=1 k=l 

Now we have for Ti, r2 G T that 

™ / if K ^2 

E (I2ti - SlT.f = V min \\x, - TiCkW^ - min - TaCfclP 

^ — ^ \ k—l k—1 

m 2 

- y^^^^f (11^' ~ Tiek\f - \\xi - T2ek\f^ 

rn K 

< (\\xi - TiCfcll^ - \\xi - T2ek 

i=l k=l 

Again with Slepian's Lemma and the triangle inequality 

m K 



2^2 



E-y sup X?T < E-y sup :::t = E-y sup 7jj. \\xi — Tck 



TeT TeT TeT 



i=i fc=i 

m K 711 K 



- ^^-y X! X! ^^-t {x^Tck) + E^ sup ^ ^ 7,fc llTe 



2 

fcll 



where the last inequality follows from the first two inequalities in Lemma [31 
Multiply by \/27r/m as above □ 

Theorem [1] follows from observing that the functions in F map to [0, 4c^] 
and combining the above bound on the Rademacher complexity with Theorem 
[S]with iV = 1 and 6 = 4. 
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The second conclusion of the proposition yields a bound for means cluster- 
ing, corresponding to the choices Y — {ei, . . . , ck} and T = {T : ||Tefe|| < 1, 1 < fc < K}. 
As already noted in Section [Z2l the vectors Tek define the cluster centers. With 
Theorem [S] we obtain 

Theorem 6. For every d > with probability greater l — S in the sample x ^ fi"^ 
we have for all T Cz T 

^ u ,,2 1 ^ „ ,,2 /l8^ /81n(lM) 

E.^^min a; -Tefcr < - V min - Te^ p + + \ 

^ k=i m ^ fc=i \ m y m 

i—l 



To prove Theorem [2] a more subtle approach is necessary. The idea is the 
following: every implementing map T e T can be factored as T = US, where S 
isa K X K matrix, S e C (M^), and U is an isometry, U G Z^(R^, H). Suitably 
bounded K x K matrices form a compact, finite dimensional set, the complexity 
of which can be controlled using covering numbers, while the complexity arising 
from the set of isometrics can be controlled with Rademacher and Gaussian 
averages. Theorem [5] then combines these complexity estimates. 

For fixed S eC (R'^) we denote 

Gs = {fus : C/eZ^(M^,-ff)}. 

Recall the notation ||T||y = supT^g^- ||r||y — sup^^T- supj,gy With S we 

denote the set of K x K matrices 

S={SeC{R''):\\S\\y<\\T\\y}. 

Lemma 4. Assume ||T||y > 1, that the functions in J-, when restricted to the 
unit ball of H , have range contained in [0, b], and that the measure fi is supported 
on the unit ball of H . Then with probability at least 1 — S we have for all T Cz T 
that 



^ m 

^x-iifT (x) /t (Xt) 

m ^ — ^ 



< 



bK 



sup Km iGs,lJ-) + 



In (l6m||r|| 



8||r|| 



In (1/5) 
2m 



Proof. Fix e > 0. The set S is the ball of radius ||T||y in the /-C ■^-dimensional 
Banach space (£ (M^) ,\\.\\y) so by Proposition [U we can find a subset C 

S, of cardinality \Sf\ < (4||T||y/e) such that every member of S can be 
approximated by a member of up to distance e in the norm ||.||y. 

We claim that for all T e T there exist U G U{R^,H) and G such that 



\fT{x)-fus, (x)| <4||r||ye, 
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for all X in the unit ball of i/. To see this write T = US with U G ll{M.^''\ H) 
and S G £(M^). Then, since U is an isometry, we have 

= sup ||5y|| = sup \\Ty\\ = \\T\\y < \\T\\y 
veY yeY 

so that S G S. We can therefore choose 5^ G Se such that — 5*11^ < e. Then 
for X E H, with ||a;|| < 1, we have 



yeY 



inf \\x - USyf - inf \\x - US,y 



yeY 



(\\x-USyf-\\x-US,y\\^) 



< sup 

V&Y 

= sup \{US,y- USy, 2x ~{USy + US,y)) \ 

yeY 

< (2 + 2||r||^)sup||(5,-5)y|| <4||r||^e. 

yeY 

Apply Theorem [5] to the finite collection of function classes {Qs : S E S^} to see 
that with probability at least 1 — 6 



^ m 

sup E^^^fx (x) /t (xi) 

Ter 

^ m 

< max sup Ex^f^fus{x) fus {xj) + 8 \\T\\y e 

WTWye 



ueUiRf^M} 



ses. 



ln|5,|+ln(l/,5) 



2m 



bK 



< sup7e„i (^s,m) + ^\ 
ses ^ \ 



In I 16m ||T| 



8||r|| 



2to 



where the last line follows from the known bound on \Se\, subadditivity of the 
square root and the choice e = l/y/m. □ 



Remark 1. If is finite dimensional the above result may be improved to 
b 



E/t - E/t < 



dK In [16m \\T\\y) 8\\T\\y ^ / ln(l/<5 ) 

2m 



(2) 



To see this, follow the same lines as in Lemma S] to note that 

sup E/t - E/t < maxE/T - E/t + 8||r||ye, 
TeT TeT, 

where 7^ is a subset of T such that every member of T can be approximated by 
a member of 7^ up to distance e in the norm ||-||y. 

By Proposition [H IT^I < (4||r||y/e) . Inequality © now follows from 
Theorem [5] with N =\Te\ and e — l/^/rn. 
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To complete the proof of Theorem [2] we now fix some S & S and focus on 
the corresponding function class Gs- 

Lemma 5. For any S* e £ (R^) we have 



K 



n{gs,^Ji)<2^J2n\\s\\y^ 



Proof. Let < 1 and define Gaussian processes flu and indexed by 

m 

f2u=Y^l, inf |lx,-C/%f 



i=l 



K 



"(7 



nS\\YY.Y.^^k{^^.Uek), 
k=l i=l 

PK Vr^^ TT. TT^ in JJf»I< 



where the are the canonical basis of R . For Ui,U2 G U{R. , H) we have 

, 2 

¥.{Qu^ - ^u.f < y ( sup \\x, - UlSyf -\\x,- U2S\\ ' 

m 

<y^snp4{x,,{U2-Ui)Syf 
,=1 



<4j2s^P\\U;x,~U:x,f\\Sy\\ 



i—i ^ 



711 K 



= 4 y y {{x,,Uiek) - {x„ U2ek)f 



i=i k=i 



It follows from Lemma [T] and Slepians lemma (Theorem 2]) that 

2 /7 



'R-miQs-.iJ) < Ex^p". — W-IE-,sup^c/, 

TO V 2 IJ 

so the result follows from the following inequalities, using Cauchy-Schwarz' and 
Jensen's inequality, the orthonormality of the and the fact that \\xi\\ < 1 on 
the support of /i. 

Kim \ 

E^supS't/ = 2|lS'|lyEsupy /y 7^j.a;,,C/efe \ 



k=l 



<2||5||^X^E 



k=l 



51 



< 211511^/^7^ 



□ 
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Substitution of the last result in Lemma |4] and noting that, for K > 1, 
2^/2txK + 8 < Uif, gives Theorem [H 

Observe that when the set S contains only the identity matrix, the function 
class Qs is the class of reconstruction errors of PCA. In this case, the result can 
be improved as shown by the next lemma. 



Lemma 6. 7^ (2?, \i) < 2^/K/rri. 

Proof. Recall, for every z ^ H , that the outer product operator Qz is defined by 
QzX = {x, z) z. With (•, ■)^ and ||-||2 denoting the Hilbert-Schmidt inner product 
and norm respectively we have for < 1 

m m 

sup a^f (xi) = Ecr sup CTi ( ||a 

m 



\UU*x, 



,uu* 



i=l 



sup \\UU* 



< VmK, 



since the Hilbert-Schmidt norm of a X-dimensional projection is v K. The result 
follows upon multiplication with 2/m and taking the expectation in /i™. □ 

An application of Theorem [5] with = 1 and b = 1 also give a generalization 
bound for PCA of order ^Kjm. 



4 Concluding remarks 

We have analyzed a general method to encode random vectors in a Hilbert space 
H. The method searches for an operator T : M.^ — >■ H which minimizes, within 
some prescribed class T, the empirical average of the reconstruction error, which 
is defined as the minimum distance between a given point in H and an image of 
the operator T acting on a prescribed codebook Y. 

We have presented two approaches to upper bound the estimation error of the 
method in terms of the parameter if, the sample size m and the properties of the 
sets T and Y. The first approach is based on a direct bound for the Rademacher 
average of the loss class induced by the reconstruction error. The bound matches 
the best known bound for X-means clustering in a Hilbert space [4] but also 
applies to other interesting coding techniques such as sparse coding and non- 
negative matrix factorization. The second approach uses a decomposition of the 
function class as a union of function classes parameterized by if-dimensional 
isometries. The main idea is to approximate the union with a finite union via 
covering numbers and then bound the complexity of each class under the union 
with Rademacher averages. This second result is more complicated than the first 
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one, however it provides in certain cases a better dependency of the bound on 
the parameter K at the expense of an additional logarithmic factor in to. 

We conclude with some open problems and possible extensions which are 
suggested by this study. Firstly, it would be valuable to investigate the possibility 
of removing the logarithmic term in m in the bound of Theorem [2] Secondly, it 
would be important to elucidate whether the dependency in K in the same bound 
is optimal. The latter problem is also mentioned in [4^ in the case of if-means 
clustering. Finally, in would be interesting to study possible improvements of our 
results in the case that additional assumptions on the probability measure are 
introduced. For example, in the case of X-means clustering in a finite dimensional 
Hilbert space [1] shows that for certain classes of probability measures the rate of 
convergence can be improved to 0(log(TO)/m) and it may be possible to obtain 
similar improvements in our general framework. 
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